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Abstract 

The  MiTAP  prototype  for  SARS  detection 
uses  human  language  technology  for  detect¬ 
ing,  monitoring,  and  analyzing  potential  indi¬ 
cators  of  infectious  disease  outbreaks  and 
reasoning  for  issuing  warnings  and  alerts.  Mi¬ 
TAP  focuses  on  providing  timely,  multi¬ 
lingual  information  access  to  analysts,  domain 
experts,  and  decision-makers  worldwide.  Data 
sources  are  captured,  filtered,  translated, 
summarized,  and  categorized  by  content. 
Critical  information  is  automatically  extracted 
and  tagged  to  facilitate  browsing,  searching, 
and  scanning,  and  to  provide  key  terms  at  a 
glance.  The  processed  articles  are  made  avail¬ 
able  through  an  easy-to-use  news  server  and 
cross-language  information  retrieval  system 
for  access  and  analysis  anywhere,  any  time. 
Specialized  newsgroups  and  customizable  fil¬ 
ters  or  searches  on  incoming  stories  allow  us¬ 
ers  to  create  their  own  view  into  the  data 
while  a  variety  of  tools  summarize,  indicate 
trends,  and  provide  alerts  to  potentially  rele¬ 
vant  spikes  of  activity. 


1  Background 

Potentially  catastrophic  biological  events  that  threaten 
US  national  security  are  steadily  increasing  in  fre¬ 
quency.  These  events  pose  immediate  danger  to  ani¬ 
mals,  plants,  and  humans.  Current  disease  surveillance 
systems  are  inadequate  for  detecting  indicators  early 
enough  to  ensure  the  rapid  response  needed  to  combat 
these  biological  events  and  corresponding  public  reac- 
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tion.  Recent  examples  of  outbreaks  include  both  the 
HIV/AIDS  and  foot  and  mouth  pandemics,  the  spread  of 
West  Nile  virus  to  and  across  the  US,  the  escape  of  Rift 
Valley  Fever  from  Africa,  SARS,  and  the  translocation 
of  both  mad  cow  disease  (BSE)  and  monkey  pox  to  the 
United  States. 

Biological  surveillance  systems  in  the  United  States 
rely  most  heavily  on  human  medical  data  for  signs  of 
epidemic  activity.  These  systems  span  multiple  organi¬ 
zations  and  agencies,  are  often  not  integrated,  and  have 
no  alerting  capability.  As  a  result,  responders  have  an 
insufficient  amount  of  lead  time  to  prepare  for  biologi¬ 
cal  events  or  catastrophes. 

Indications  and  Warnings  (I&Ws)  provide  the  poten¬ 
tial  for  early  alert  of  impending  biological  events,  per¬ 
haps  weeks  to  months  in  advance.  Sources  of  I&Ws 
include  transportation  data,  telecommunication  traffic, 
economic  indices,  Internet  news,  RSS  feeds  (RSS)  in¬ 
cluding  weblogs,  commerce,  agricultural  surveillance, 
weather,  and  other  environmental  data.  Retrospective 
analyses  of  major  infectious  disease  outbreaks  (e.g., 
West  Nile  Vims  and  SARS)  show  that  I&Ws  were  pre¬ 
sent  weeks  to  months  in  advance,  but  these  indicators 
were  missed  because  data  sources  were  difficult  to  ob¬ 
tain  and  hard  to  integrate.  As  a  result,  the  available  in¬ 
formation  was  not  utilized  for  appropriate  national  and 
international  response.  This  illuminates  a  critical  need  in 
biodefense  for  an  integrated  system  linking  I&Ws  for 
biological  events  from  multiple  and  disparate  sources 
with  the  response  community. 

2  Introduction 

MiTAP  (Damianos  et  al.  2002)  was  originally  devel¬ 
oped  by  the  MITRE  Corporation  under  the  Defense 
Advanced  Research  Projects  Agency  (DARPA) 
Translingual  Information  Detection  Extraction  and 
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Summarization  (TIDES)  program.  TIDES  aims  to  revo¬ 
lutionize  the  way  that  information  is  obtained  from  hu¬ 
man  language  by  enabling  people  to  find  and  interpret 
relevant  information  quickly  and  effectively,  regardless 
of  language  or  medium.  MiTAP  was  initially  created  for 
tracking  and  monitoring  infectious  disease  outbreaks 
and  other  biological  threats  as  part  of  a  DARPA  Inte¬ 
grated  Feasibility  Experiment  in  biosecurity  to  explore 
the  integration  of  synergistic  TIDES  language  process¬ 
ing  technologies  applied  to  a  real  world  domain.  The 
system  has  since  been  expanded  to  other  domains  such 
as  weapons  of  mass  destruction,  satellite  monitoring, 
and  suspect  terrorist  activity.  In  addition,  researchers 
and  analysts  are  examining  hundreds  of  MiTAP  data 
sources  for  differing  perspectives  on  conflict  and  hu¬ 
manitarian  relief  efforts. 

Our  newest  MiTAP  prototype  explores  the  integra¬ 
tion  of  outputs  from  operational  data  mining  (anomaly 
detection),  human  language  technology  (information 


extraction,  temporal  tagging,  machine  translation,  cross¬ 
language  information  retrieval),  and  visualization  tools 
to  detect  SARS-specific  I&Ws  in  Asia,  with  relevance 
to  pathogen  translocation  to  the  United  States.  Using 
feeds  from  English  and  Chinese  language  newswire, 
weblogs,  and  other  Internet  data,  the  system  translates 
Chinese  text  data  and  tracks  keyword  combinations 
thought  to  represent  I&Ws  specific  to  SARS  outbreaks 
in  China.  Analysts  can  use  cross-language  information 
retrieval  for  retrospective  analysis  and  improving  the 
I&W  model,  save  searches  to  use  as  filters  on  incoming 
data,  view  trends,  and  visualize  the  data  along  a  time¬ 
line.  Figure  1  shows  an  overview  of  the  prototype. 

Warnings  generated  by  this  MiTAP  prototype  are  in¬ 
tended  to  complement  traditional  biosurveillance  and 
communications  already  in  use  by  the  international  pub¬ 
lic  health  community.  This  system  represents  an  expan¬ 
sion  of  current  US  surveillance  capabilities  to  detect 
biological  agents  of  catastrophic  potential. 
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Figure  1  Overview  of  the  MiTAP  prototype  for  SARS  detection. 


3  Component  Technologies 

The  MiTAP  prototype  relies  extensively  on  human 
language  technology  and  expert  system  reasoning. 
Below,  MiTAP  capabilities  are  described  briefly 
along  with  their  contributing  component 
technologies. 

3.1  Information  Processing 

After  Internet  news  sources  are  captured  and 
normalized,  they  are  passed  through  a  zoner  using 
human-generated  rules  to  identify  source,  date,  and 
other  information  such  as  headline,  or  title,  and 
content.  The  Alembic  natural  language  analyzer  (Ab¬ 
erdeen  et  al.  1995;  Vilain  and  Day  1996)  processes 
the  zoned  messages  to  identify  paragraph,  sentence, 
and  word  boundaries  as  well  as  part-of-speech  tags. 
The  messages  then  pass  through  the  Alembic  named 
entity  recognizer  for  identification  and  tagging  of 
person,  organization,  location,  and  disease  names. 
Finally,  the  article  is  processed  by  the  TempEx 
normalizing  time  expression  tagger  (Mani  and  Wil¬ 
son  2000). 

For  Chinese  and  other  non-English  sources,  the 
CyberTrans  machine  translation  system  (Miller  et  al. 
2001)  is  used  to  translate  articles  automatically  into 
English.  CyberTrans  wraps  commercial  and  research 
translation  engines  to  produce  a  common  set  of 
interfaces;  the  current  prototype  makes  use  of  the 
SYSTRAN  Chinese-English  system. 

RSS  feeds  can  provide  a  high  volume  textual  ge¬ 
stalt.  Weblogs,  in  particular,  are  a  good  source  of 
timely  text,  some  of  which  is  topical  and  all  of  which 
is  based  on  personal  observations  and  experiences. 
Aggregate  measurements  on  these  feeds  can  provide 
indications  of  public  health-related  phenom¬ 
ena.  Consider  the  relative  rates  of  words  and  phrases 
such  as  "stay  home  from"  or  "pneumonia.”  Geotem¬ 
poral  location  of  non-seasonal  spikes  in  relative  rank 
of  these  strings  can  establish  suspicion  for  further 
investigation  by  I&W  experts. 

3.2  Browsing 

English  language  data  and  pairs  of  foreign  language 
documents  and  their  translated  versions  are  made 
available  on  a  news  server  (INN  2001)  for  browsing. 
The  system  categorizes  and  bins  articles  into 
newsgroups  based  on  their  content.  To  do  this,  the 
system  relies  on  a  combination  of  the  information 
extraction  results  as  well  as  human-generated  rules 
for  pattern  matching.  Newsgroups  are  created  to 
provide  multiple  perspectives  on  the  data;  analysts 
can  subscribe  to  specific  disease  tracking 
newsgroups,  regional  newsgroups,  specific  data 


source  newsgroups,  or  to  customized  topic  tracking 
newsgroups  that  may  be  based  on  several  related 
subjects. 

Tagged  entities  in  each  article  are  color-coded  to 
enable  rapid  scanning  of  information  and  easy  identi¬ 
fication  of  key  names.  The  five  most  frequently  men¬ 
tioned  locations  in  each  article  as  well  as  the  top  five 
people  are  presented  as  a  list  for  quick  reference. 

3.3  Information  Retrieval 

To  supplement  access  to  the  articles  on  the  news 
server  and  to  allow  for  retrospective  analysis,  articles 
are  indexed  using  the  Lucene  information  retrieval 
system  (The  Jakarta  Project  2001)  for  English 
language  documents  and  using  PSE  (Darwish  2002) 
for  foreign  language  documents.  Web  links  are 
maintained  between  foreign  language  documents  and 
their  translated  versions  to  allow  for  more  accurate 
human  translations  of  selected  documents. 

Analysts  can  perform  full  text,  source-specific 
queries  over  the  entire  set  of  archived  documents  and 
view  the  retrieved  results  as  a  relevance-ranked  list  or 
as  a  plot  across  a  timeline.  A  cross-language  informa¬ 
tion  retrieval  interface  allows  users  to  search  in  Eng¬ 
lish  across  the  Chinese  language  sources. 

Users  can  also  save  specific  search  constraints  to 
be  used  as  filters  on  incoming  data.  These  saved 
searches  provide  a  simple  analytic  capability  as  well 
as  an  alerting  feature.  (See  below.) 

3.4  Analysis 

To  assist  analysts  in  identifying  relevant  and  related 
articles,  we  have  integrated  multi-document  summa¬ 
rization  and  watch  lists.  Columbia  University’s 
Newsblaster  (McKeown  et  al.  2002)  automatically 
detects  daily  topics,  clusters  MiTAP  articles  around 
those  topics,  and  generates  multi-document  summari- 
zations  which  are  made  available  on  the  news  server. 
Multiple  technologies  (e.g.,  coreference,  information 
extraction)  from  Alias  I,  Inc.  (Baldwin  et  al.  2002) 
produces  comprehensive  views  on  specific  named 
entities  (i.e.,  people  or  disease)  across  MiTAP  docu¬ 
ments.  These  views  are  summarized  through  ranked 
lists,  highlighting  important  topics  of  the  day  and 
activities  which  might  indicate  disease  outbreak. 

Finely-tuned  searches  can  be  saved  and  applied  as 
filters  or  topic  tracking  mechanisms.  These  saved 
searches  are  automatically  updated  at  specific  inter¬ 
vals  and  can  be  aggregated  and  displayed  visually  as 
bar  graphs  to  reveal  spikes  of  activity  that  otherwise 
might  go  undetected. 

3.5  Alerting 

The  MiTAP  prototype  has  two  separate  alerting  ca¬ 
pabilities:  saved  searches  and  an  integrated  expert 


system.  The  saved  search  functionality  allows  ana¬ 
lysts  to  set  thresholds  for  alerting  purposes.  For  ex¬ 
ample,  MiTAP  can  send  email  when  any  new  article 
arrives,  when  a  specified  maximum  number  of  arti¬ 
cles  arrives,  or  when  the  daily  number  of  new  articles 
increases  by  some  percentage  of  the  total  or  moving 
average. 

The  Human  Language  Indication  Detector 
(HLID)  performs  data  fusion  on  a  number  of  dispa¬ 
rate  sources,  compressing  a  large  volume  of  informa¬ 
tion  into  a  smaller  but  more  significant  set  of  alerts. 
HLID  monitors  a  variety  of  sources  including  MiTAP 
articles,  information  events  in  RSS  feeds,  and  other 
dynamically  updated  information  on  the  World  Wide 
Web.  HLID  analyzes  events  from  these  sources  in 
real  time  and  generates  an  estimate  of  significance 
for  each,  complete  with  an  audit  trail  of  supporting 
and  negating  evidence.  This  allows  an  analyst  to  di¬ 
rect  a  search  for  indicators  towards  interesting  data 
while  reducing  the  time  spent  investigating  false 
alarms  and  insignificant  events. 

HLID  is  composed  of  four  major  components. 
The  first  is  an  event  collector,  which  monitors  a  data 
source  and  triggers  action  when  an  event  is  observed. 
These  events  are  sent  to  the  rule  based  reasoning  en¬ 
gine,  an  expert  system  shell  (JESS  2004)  with  hand 
authored  rules.  The  engine  performs  vetting  and  ini¬ 
tial  investigation  of  each  event  by  identifying  corre¬ 
lated  events,  corroborating  or  invalidating  evidence, 
and  references  to  supporting  information.  The  engine 
can  also  supplement  its  knowledge  base  by  perform¬ 
ing  a  directed  search  via  the  query  management  sys¬ 
tem,  which  allows  retrieval  of  information  from  a 
wide  variety  of  sources  including  databases  and  web 
pages.  Lastly,  the  alerting  mechanism  disseminates 
the  conclusions  reached  by  the  system  and  provides 
an  interface  that  allows  an  analyst  to  launch  a  deeper 
search  for  indicators  and  warnings. 
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